Skip to content

Complex Dtype Support for Hashmap Algos #36482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Sep 4, 2021

Conversation

alimcmaster1
Copy link
Member

@alimcmaster1 alimcmaster1 commented Sep 19, 2020

Ref: #18009

Based on #27599

First set of tests for complex number handling + sensible results from functions that rely on hash tables.

Use generic object hashing for now.

@jbrockmendel you interested in reviewing?

@alimcmaster1 alimcmaster1 added the Complex Complex Numbers label Sep 19, 2020
@@ -0,0 +1,129 @@
import numpy as np
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better locations for this test file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split the tests to the appropriate files: pandas/tests/series/methods/test_value_counts.py

for example

@pandas-dev pandas-dev deleted a comment from pep8speaks Sep 19, 2020
@alimcmaster1 alimcmaster1 changed the title ENH: Complex Dtype Support for Hashmap Algos WIP: Complex Dtype Support for Hashmap Algos Sep 19, 2020
@jbrockmendel
Copy link
Member

do we have a list of what algorithms this is used for? if they are all unique/factorize-like, we might be able to do a .view(floatlike) to avoid an object cast

@WillAyd
Copy link
Member

WillAyd commented Sep 21, 2020

This seems OK to me to start - could probably optimize later with things like .view(floatlike) though there may be some drawbacks to doing that as well

@github-actions github-actions bot added the Stale label Oct 22, 2020
@jbrockmendel
Copy link
Member

I'm finding in a mostly-unrelated branch that having actual support for complex (in particular complex128) would be tremendously helpful. Just gentle encouragement.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well

@@ -0,0 +1,129 @@
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split the tests to the appropriate files: pandas/tests/series/methods/test_value_counts.py

for example

@alimcmaster1
Copy link
Member Author

Sure will take a look @jreback

@alimcmaster1
Copy link
Member Author

I'm finding in a mostly-unrelated branch that having actual support for complex (in particular complex128) would be tremendously helpful. Just gentle encouragement.

Sure which branch was this? This doesn’t offer complex128 support - but makes a lot of basic functional work. But I can perhaps look into that in a follow up @jbrockmendel

@alimcmaster1
Copy link
Member Author

fixing this up today.

@pandas-dev pandas-dev deleted a comment from pep8speaks Aug 26, 2021
@alimcmaster1
Copy link
Member Author

There are no test cases with not-a-number values, like np.nan, 1j*np.nan, np.nan+1j, 1j+np.nan and np.nan+1j*np.nan (best all at once). These are often tricky and are easily overlooked - thus always a good idea to have some unit tests with them.

Good idea added some testing around this in test_duplicates.py

Added testing for complex64 and complex128 in test_duplicates.py , test_reductions.py and test_value_counts.py

I think as a follow up we can address the inferred_type issue in complex indexes:

pd.Series([3, 2, 1], index=pd.Index([3j, 1 + 1j, 1])).index.inferred_type
Out[23]: 'mixed-integer'
(Would expect this to be "complex")

In [24]: pd.Series([3, 2, 1], index=pd.Index([3j, 1 + 1j, 1], dtype=np.complex128)).index.inferred_type
Out[24]: 'complex'

@alimcmaster1 alimcmaster1 added this to the 1.4 milestone Aug 26, 2021
@jreback
Copy link
Contributor

jreback commented Aug 26, 2021

cc @realead if you can look

Copy link
Contributor

@realead realead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimcmaster1 Please delete outdated comment, otherwise lgtm.

],
)
def test_value_counts_complex_numbers(self, input_array, expected):
# Complex Index dtype is cast to object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still valid? IIUC value_counts uses complex128/256 and not objects, see

cpdef value_count(ndarray[htfunc_t] values, bint dropna):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtype of the index will be objects, see below. Agree this probably needs fixing - I can create a follow up. As this is the same issue as your comment below refers too. #36482 (comment)

In [14]: pd.Series([1 + 1j, 1 + 1j, 1, 3j, 3j, 3j]).value_counts().index
Out[14]: Index([3j, (1+1j), (1+0j)], dtype='object')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls create a followon issue

[
(
[1 + 1j, 0, 1, 1j, 1 + 2j, 1 + 2j],
np.array([(1 + 1j), 0j, (1 + 0j), 1j, (1 + 2j)], dtype=object),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, dtype=object here... I ask myself, whether we should have a unique-function with fused types, like we already do for other functions e.g. value_counts (

cpdef value_count(ndarray[htfunc_t] values, bint dropna):
)
What do you think @jbrockmendel? Probably should not be part of this PR though. It looks like dtype=object for factorize and Index are just consequences of unique returning objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - whenever you have time, think your eyes on this would be much appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah i think ideally this should return a complex dtype (same for factorize above). OK for that to be a separate PR, can leave a comment on the test to that effect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great have added comments to that effect - will create a follow up issue

@alimcmaster1 alimcmaster1 requested a review from realead August 31, 2021 22:09
@alimcmaster1
Copy link
Member Author

alimcmaster1 commented Sep 2, 2021

restarting CI test failures unrelated.

Could not find conda environment: pandas-dev
You can list all discoverable environments with `conda info --envs`.

Error: Process completed with exit code 1.

All comments addressed this should be good.

@alimcmaster1
Copy link
Member Author

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alimcmaster1
Copy link
Member Author

LGTM

Appreciated the review :)

@alimcmaster1 alimcmaster1 requested a review from jreback September 3, 2021 20:39
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks a lot @alimcmaster1

if you can create the followon (checkboxes if needed) would be great

@jreback jreback merged commit d08a792 into pandas-dev:master Sep 4, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complex Complex Numbers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Functions that rely on hash tables are incorrect for complex numbers
7 participants